AITopics | backward transfer

Technology: Information Technology > Artificial Intelligence > Machine Learning > Evolutionary Systems (0.40)

Neural Information Processing SystemsFeb-7-2026, 10:15:28 GMT

0a3b6f64f0523984e51323fe53b8c504-Paper.pdf

continual learning, learning, nctl, (14 more...)

Country:

North America > United States (0.14)
Oceania > Australia > New South Wales (0.04)
North America > Canada (0.04)

Industry: Education > Educational Setting (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

David Lopez-Paz, Marc'Aurelio Ranzato

Gradient Episodic Memory for Continual Learning

Neural Information Processing SystemsNov-21-2025, 13:57:16 GMT

One major obstacle towards AI is the poor ability of models to solve new problems quicker, and without forgetting previously acquired knowledge.

artificial intelligence, learning, machine learning, (16 more...)

Country:

North America > United States > Texas > Travis County > Austin (0.14)
North America > Canada > Ontario > Toronto (0.14)
North America > United States > California > Los Angeles County > Long Beach (0.04)

Industry:

Education (0.94)
Health & Medicine > Consumer Health (0.42)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

arXiv.org Artificial IntelligenceNov-11-2025

Mixtures of SubExperts for Large Language Continual Learning

Kang, Haeyong

Adapting Large Language Models (LLMs) to a continuous stream of tasks is a critical yet challenging endeavor. While Parameter-Efficient Fine-Tuning (PEFT) methods have become a standard for this, they face a fundamental dilemma in continual learning. Reusing a single set of PEFT parameters for new tasks often leads to catastrophic forgetting of prior knowledge. Conversely, allocating distinct parameters for each task prevents forgetting but results in a linear growth of the model's size and fails to facilitate knowledge transfer between related tasks. To overcome these limitations, we propose a novel adaptive PEFT method referred to as \textit{Mixtures of SubExperts (MoSEs)}, a novel continual learning framework designed for minimal forgetting and efficient scalability. MoSEs integrate a sparse Mixture of SubExperts into the transformer layers, governed by a task-specific routing mechanism. This architecture allows the model to isolate and protect knowledge within dedicated SubExperts, thereby minimizing parameter interference and catastrophic forgetting. Crucially, the router can adaptively select and combine previously learned sparse parameters for new tasks, enabling effective knowledge transfer while ensuring that the model's capacity grows sublinearly. We evaluate MoSEs on the comprehensive TRACE benchmark datasets. Our experiments demonstrate that MoSEs significantly outperform conventional continual learning approaches in both knowledge retention and scalability to new tasks, achieving state-of-the-art performance with substantial memory and computational savings.

large language model, machine learning, natural language, (14 more...)

2511.06237

Country:

North America > United States (0.28)
Europe > Romania > Sud - Muntenia Development Region > Giurgiu County > Giurgiu (0.04)

Genre: Research Report > New Finding (0.46)

Industry: Education > Educational Setting (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Harmon, Jackson, Hochlehnert, Andreas, Bethge, Matthias, Prabhu, Ameya

Mapping Post-Training Forgetting in Language Models at Scale

arXiv.org Artificial IntelligenceOct-21-2025

Scaled post-training now drives many of the largest capability gains in language models (LMs), yet its effect on pretrained knowledge remains poorly understood. Not all forgetting is equal: Forgetting one fact (e.g., a U.S. president or an API call) does not "average out" by recalling another. Hence, we propose a sample-wise paradigm to measure what is forgotten and when backward transfer occurs. Our metric counts 1->0 transitions (correct before post-training, incorrect after) to quantify forgetting and 0->1 transitions to quantify backward transfer. Traditional task averages conflate these effects and obscure large changes. For multiple-choice benchmarks, we add chance-adjusted variants that subtract the expected contribution of random guessing from pre- and post-training accuracies. We apply this framework across post-training stages, model sizes, and data scales. Our large-scale analysis shows that: (1) Domain-continual pretraining induces moderate forgetting with low-to-moderate backward transfer; (2) RL/SFT post-training applied to base models and Instruction tuning yields moderate-to-large backward transfer on math and logic with overall low-to-moderate forgetting; (3) Applying RL/SFT to instruction-tuned models is sensitive on data scale: at small scales, both forgetting and backward transfer are small; at larger scales, effects are mixed and warrant further study with better controls; (4) Model merging does not reliably mitigate forgetting. Overall, our framework offers a practical yardstick for mapping how post-training alters pretrained knowledge at scale -- enabling progress towards generally capable AI systems.

large language model, machine learning, natural language, (15 more...)

2510.17776

Country: North America > United States > Minnesota (0.28)

Genre: Research Report (1.00)

Industry:

Education > Curriculum > Subject-Specific Education (1.00)
Government > Regional Government > North America Government > United States Government (0.48)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.95)

Neural Information Processing SystemsOct-3-2025, 03:07:36 GMT

83da7c539e1ab4e759623c38d8737e9e-AuthorFeedback.pdf

We thank the reviewers for the constructive feedback. Code will be made public. Fig. (a, b, c) best viewed in zoom. See R3.1 for comparison between random selection and genetic algorithms. Our proposed RPS-Net consistently performs better across all budgets.

artificial intelligence, evolutionary algorithm, machine learning, (16 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Evolutionary Systems (0.40)

Neural Information Processing SystemsOct-2-2025, 00:13:33 GMT

A Combinatorial Perspective on Transfer Learning

Instead of considering "online" and "continual" learning as inconvenient constraints to avoid, in

artificial intelligence, machine learning, nctl, (17 more...)

Country:

North America > United States (0.14)
Oceania > Australia > New South Wales (0.04)
North America > Canada (0.04)

Industry: Education > Educational Setting (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

arXiv.org Artificial IntelligenceSep-30-2025

Omni-Thinker: Scaling Multi-Task RL in LLMs with Hybrid Reward and Task Scheduling

Li, Derek, Zhou, Jiaming, Brunswic, Leo Maxime, Ghaddar, Abbas, Sun, Qianyi, Ma, Liheng, Luo, Yu, Li, Dong, Coates, Mark, Hao, Jianye, Zhang, Yingxue

The pursuit of general-purpose artificial intelligence depends on large language models (LLMs) that can handle both structured reasoning and open-ended generation. We present Omni-Thinker, a unified reinforcement learning (RL) framework that scales LLMs across diverse tasks by combining hybrid rewards with backward-transfer-guided scheduling. Hybrid rewards integrate rule-based verifiable signals with preference-based evaluations from an LLM-as-a-Judge, enabling learning in both deterministic and subjective domains. Our scheduler orders tasks according to accuracy backward transfer (BWT), reducing forgetting and improving multi-task performance. Experiments across four domains show gains of 6.2% over joint training and 12.4% over model merging. Moreover, we demonstrate that simple assumptions on accuracy transfer yield accurate predictions of curriculum outcomes, with entropy dynamics explaining deviations due to generative tasks. These findings underscore the importance of BWT-aware scheduling and hybrid supervision for scaling RL-based post-training toward general-purpose LLMs.

arxiv preprint arxiv, large language model, natural language, (15 more...)

2507.14783

Genre: Research Report > New Finding (1.00)

Industry: Education (0.68)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

arXiv.org Artificial IntelligenceAug-22-2025

High-dimensional Asymptotics of Generalization Performance in Continual Ridge Regression

Zhao, Yihan, Su, Wenqing, Yang, Ying

Continual learning is motivated by the need to adapt to real-world dynamics in tasks and data distribution while mitigating catastrophic forgetting. Despite significant advances in continual learning techniques, the theoretical understanding of their generalization performance lags behind. This paper examines the theoretical properties of continual ridge regression in high-dimensional linear models, where the dimension is proportional to the sample size in each task. Using random matrix theory, we derive exact expressions of the asymptotic prediction risk, thereby enabling the characterization of three evaluation metrics of generalization performance in continual learning: average risk, backward transfer, and forward transfer. Furthermore, we present the theoretical risk curves to illustrate the trends in these evaluation metrics throughout the continual learning process. Our analysis reveals several intriguing phenomena in the risk curves, demonstrating how model specifications influence the generalization performance. Simulation studies are conducted to validate our theoretical findings.

artificial intelligence, machine learning, matrix, (11 more...)